Differentially Private Heatmaps

ABSTRACT

Improved methods are provided for generating heatmaps or other summary map data from multiple users&#39; data (e.g., probability distributions) in a manner that preserves the privacy of the users&#39; data while also generating heatmaps that are visually similar to the ‘true’ heatmap. These methods include decomposing the average of the users&#39; data (the ‘true’ heatmap) into multiple different spatial scales, injecting random noise into the data at the multiple different spatial scales, and then reconstructing the privacy-preserving heatmap based on the noisy multi-scale representations. The magnitude of the noise injected at each spatial scale is selected to ensure preservation of privacy while also resulting in heatmaps that are visually similar to the ‘true’ heatmap.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a non-provisional patent application claiming priority to U.S. Provisional Patent Application No. 63/221,774, filed Jul. 14, 2021, the contents of which are hereby incorporated by reference.

BACKGROUND

It is beneficial in a variety of applications to generate “heatmaps” or similar summary information about the distribution of something of interest (e.g., users) across as two (or more) dimensional space. For example, a heatmap indicating the most likely locations of attendees within an amusement park could be used to determine where to locate drinking fountains or other facilities, where to allocate maintenance or upgrade efforts, etc. In another example, a heatmap indicating where a user looked at a user interface (e.g., determined vua ‘gaze detection’) could be used to determine which portions of the interface were most useful and/or had the greatest impact on a user's perception. Regardless of the application, such heatmaps are often generated based on private data (e.g., user location data, gaze tracking data). Accordingly, it is necessary to generate heatmaps from such data in a manner that preserves the privacy of the users whose data is used.

SUMMARY

According to a first aspect of the present disclosure, there is provided a computer-implemented method including: (i) receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity; (ii) using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales; (iii) adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales; (iv) based on the anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation.

According to a second aspect of the present disclosure, there is provided a computer-implemented method including: (i) receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity; (ii) using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales; (iii) adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales; (iv) sparsifying the anonymized multi-scale representation such that no more than a specified number of elements of any of the two or more representations of the input distribution at respective different spatial scales are non-zero; and (v) based on the sparsified anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the sparsified anonymized multi-scale representation.

According to a third aspect of the present disclosure, a computing device is provided that includes one or more processors, wherein the one or more processors are configured to perform the method of either of the first or second aspects above.

According to a fourth aspect of the present disclosure, an article of manufacture including a non-transitory computer-readable medium is provided, having stored thereon program instructions that, upon execution by a computing device, cause the computing device to perform operations to effect the method of either of the first or second aspects above.

BRIEF DESCRIPTION OF THE FIGURES

The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

FIG. 1 illustrates aspects of an example system.

FIG. 2 illustrates a flowchart of an example method.

FIG. 3 illustrates a flowchart of an example method.

FIG. 4 illustrates experimental results, according to example methods.

FIG. 5 illustrates experimental results, according to example methods.

FIG. 6 illustrates experimental results, according to example methods.

DETAILED DESCRIPTION

The following detailed description describes various features and functions of the disclosed systems and methods with reference to the accompanying figures. The illustrative system and method embodiments described herein are not meant to be limiting. It may be readily understood that certain aspects of the disclosed systems and methods can be arranged and combined in a wide variety of different configurations, all of which are contemplated herein.

I. Overview

It is desirable in a variety of contexts to provide “heat maps” of the level of activity or some other factor of interest across a two-dimensional area (or other-dimensional space). For example, it can be desirable to provide a heat map that shows where people, devices, or other entities are likely to go within a store or other space based on GPS tracks and/or location check-ins from individual people or other individual entities. In another example, it can be desirable to provide a heat map that shows where viewers are likely to look at a display or other target of interest based on detected eye gaze data. In another example, it can be desirable to provide a heatmap to show where viewers are likely to look at/interact with a user interface based on cursor track, button press locations, or other information. Heatmaps could be generated for gene microdata (e.g., to illustrate the distribution of different genotypes) based on gene sequences or allele vectors for individual people. However, because such heat maps are often generated based on private data aggregated from many users, devices, or other entities, it is necessary to “anonymize” the heat maps such that the private data of any individual entity, or specified group of entities, cannot be recovered, even in scenarios where the heat maps may be generated based on queries generated by users, devices, or other entities.

The methods described herein can be used to generate such heatmaps (or to generate distributions that can then be smoothed, upsampled, or otherwise processed to generated such heatmaps) in a manner that mathematically provably preserves the privacy of the individuals' source data. These methods also provide distributions that “look” correct, since the methods generate anonymized output distributions that approximate the non-anonymized input distributions with respect to the “earth mover's distance” (EMD) between the input and output distributions while still ensuring that the output distribution cannot be used to recover the private source data used to generate the input distribution.

These benefits are achieved by transforming the input distribution into a multi-scale representation (e.g., by applying a pyramidal transform) and then adding noise to the multi-scale representation in a manner that is dependent upon spatial scale. So, for example, aspects of the multi-scale representation that represent the input distribution at a low spatial scale will have lower-magnitude noise (e.g., lower-variance Laplacian-distributed noise) added thereto than aspects of the multi-scale representation that represent the input distribution at a higher spatial scale, which will have added thereto higher-magnitude noise (e.g., higher-variance Laplacian-distributed noise). The noisy multi-scale representation can then be used to reconstruct an output distribution by attempting to generate an output distribution (e.g., using linear programming, gradient descent, etc.) that, when transformed into the multi-scale representation, approximates the noisy multi-scale representation (e.g., with respect to

1 distance between the representations) according to the EMD, thus appearing perceptually similar to the non-anonymized input distribution while preventing private data (e.g., individual location check-in distributions) from being determined therefrom.

This method can be made more efficient and/or computationally tractable with respect to compute cycles and memory use by “sparsifying” the noisy multi-scale representation prior to reconstruction. This “sparsification” process includes removing low-value elements of the noisy multi-scale representation at each spatial scale. For example, removing low-value elements from each spatial scale of the noisy multi-scale representation such that no more than a maximum number of elements remain non-zero in each spatial scale of the noisy multi-scale representation following the sparsification. This allows the subsequent reconstruction process to operate on significantly fewer non-zero elements, thereby reducing the cycle cost and memory footprint of the reconstruction.

With respect to embodiments that include receiving private entity data (e.g., data that is private to a specific user, a specific user in a specific session, a specific device, or some other specific entity) in order to generate heatmaps (e.g., by taking the mean of such data and then adding noise thereto so as to anonymize the mean), interactions by an entity's computing device with cloud-based servers, or otherwise involving sharing captured information (e.g., GPS tracks) with other computing devices, an entity may be provided with controls allowing the entity to make an election as to both if and when systems, programs, or features described herein may enable collection of entity information (e.g., information about an entity's location, gaze direction, interaction with user interface elements), and if the entity is sent content or communications from a server. In addition, certain data may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, an entity's identity may be treated so that no personally identifiable information can be determined for the entity, or an entity's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of an entity cannot be determined. Thus, the entity may have control over what information is collected about the entity, how that information is used, and what information is provided to the entity.

II. Illustrative Systems

FIG. 1 illustrates an example computing device 100 that may be used to implement the methods described herein. By way of example and without limitation, computing device 100 may be a cellular mobile telephone (e.g., a smartphone), a computer (such as a desktop, notebook, tablet, or handheld computer, a server), elements of a cloud computing system, a robot, a drone, an autonomous vehicle, or some other type of device. It should be understood that computing device 100 may represent a physical computing device such as a server, a particular physical hardware platform on which a machine learning application operates in software, or other combinations of hardware and software that are configured to carry out machine learning functions as described herein.

As shown in FIG. 1 , computing device 100 may include a communication interface 102, a user interface 104, a processor 106, and data storage 108, all of which may be communicatively linked together by a system bus, network, or other connection mechanism 110.

Communication interface 102 may function to allow computing device 100 to communicate, using analog or digital modulation of electric, magnetic, electromagnetic, optical, or other signals, with other devices, access networks, and/or transport networks. Thus, communication interface 102 may facilitate circuit-switched and/or packet-switched communication, such as plain old telephone service (POTS) communication and/or Internet protocol (IP) or other packetized communication. For instance, communication interface 102 may include a chipset and antenna arranged for wireless communication with a radio access network or an access point. Also, communication interface 102 may take the form of or include a wireline interface, such as an Ethernet, Universal Serial Bus (USB), or High-Definition Multimedia Interface (HDMI) port. Communication interface 102 may also take the form of or include a wireless interface, such as a Wifi, BLUETOOTH®, global positioning system (GPS), or wide-area wireless interface (e.g., WiMAX or 3GPP Long-Term Evolution (LTE)). However, other forms of physical layer interfaces and other types of standard or proprietary communication protocols may be used over communication interface 102. Furthermore, communication interface 102 may comprise multiple physical communication interfaces (e.g., a Wifi interface, a BLUETOOTH® interface, and a wide-area wireless interface).

In some embodiments, communication interface 102 may function to allow computing device 100 to communicate, with other devices, remote servers, access networks, and/or transport networks. For example, the communication interface 102 may function to access one or more privacy-preserving algorithms as described herein for generating heatmaps or distributions from entity data (e.g., location tracks, location check-ins, gaze tracks, cursor tracks) and/or input therefor via communication with a remote server or other remote device or system in order to allow the computing device 100 to use the algorithms to generate outputs (e.g., anonymized heatmaps or distributions) based on input data.

User interface 104 may function to allow computing device 100 to interact with a user or other entity, for example to receive input from and/or to provide output to the user. Thus, user interface 104 may include input components such as a keypad, keyboard, touch-sensitive or presence-sensitive panel, computer mouse, trackball, joystick, microphone, and so on. User interface 104 may also include one or more output components such as a display screen which, for example, may be combined with a presence-sensitive panel. The display screen may be based on CRT, LCD, and/or LED technologies, or other technologies now known or later developed. User interface 104 may also be configured to generate audible output(s), via a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices.

Processor 106 may comprise one or more general purpose processors—e.g., microprocessors—and/or one or more special purpose processors—e.g., digital signal processors (DSPs), graphics processing units (GPUs), floating point units (FPUs), network processors, tensor processing units (TPUs), or application-specific integrated circuits (ASICs). In some instances, special purpose processors may be capable of image processing, image alignment, merging images, transforming images, or reconstructing distribution, among other applications or functions. Data storage 108 may include one or more volatile and/or non-volatile storage components, such as magnetic, optical, flash, or organic storage, and may be integrated in whole or in part with processor 106. Data storage 108 may include removable and/or non-removable components.

Processor 106 may be capable of executing program instructions 118 (e.g., compiled or non-compiled program logic and/or machine code) stored in data storage 108 to carry out the various functions described herein. Therefore, data storage 108 may include a non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by computing device 100, cause computing device 100 to carry out any of the methods, processes, or functions disclosed in this specification and/or the accompanying drawings. The execution of program instructions 118 by processor 106 may result in processor 106 using data 112.

By way of example, program instructions 118 may include an operating system 122 (e.g., an operating system kernel, device driver(s), and/or other modules) and one or more application programs 120 (e.g., functions for executing heatmap generation algorithms) installed on computing device 100. Data 112 may include input distributions 114 and/or average distributions determined therefrom 116.

Application programs 120 may communicate with operating system 122 through one or more application programming interfaces (APIs). These APIs may facilitate, for instance, application programs 120 transmitting or receiving information via communication interface 102, receiving and/or displaying information on user interface 104, and so on.

Application programs 120 may take the form of “apps” that could be downloadable to computing device 100 through one or more online application stores or application markets (via, e.g., the communication interface 102). However, application programs can also be installed on computing device 100 in other ways, such as via a web browser or through a physical interface (e.g., a USB port) of the computing device 100.

III. Example Methods

FIG. 2 is a flowchart of a computationally efficient and tractable method 200 for generating heatmaps or related distributions from input distributions in a manner that preserves the privacy of the input distributions. The method 200 includes receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity (210). The method 200 additionally includes, using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales (220). The method 200 additionally includes adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales (230). The method 200 additionally includes, based on the anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation (240). The method 200 could include additional or alternative features.

FIG. 3 is a flowchart of a computationally efficient and tractable method 300 for generating heatmaps or related distributions from input distributions in a manner that preserves the privacy of the input distributions. The method 300 includes receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity (310). The method 300 additionally includes, using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales (320). The method 300 additionally includes adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales (330). The method 300 additionally includes sparsifying the anonymized multi-scale representation such that no more than a specified number of elements of any of the two or more representations of the input distribution at respective different spatial scales are non-zero (340). The method 300 also includes, based on the sparsified anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the sparsified anonymized multi-scale representation (350). The method 300 could include additional or alternative features.

IV. Example Embodiments

Differential privacy (DP) provides a strong definition of user privacy for data aggregation and machine learning, with practical deployments including the 2022 US Census, in industry, and in popular machine learning libraries. DP algorithms have been developed for several analytic tasks involving aggregation of user data.

One of the basic data aggregation tools is a heatmap. Heatmaps are popular for visualizing aggregated data in two or higher dimensions. They can be used in a variety of applications, e.g., in computer vision and image processing, spatial data analysis, bioinformatics. Many of these applications involve protecting the privacy of user data. For example, a heatmap of popular locations in a geographic area can be determined based on user location check-ins, which represent sensitive private data. In another example, heatmaps for gene microdata or gaze can be determined based on private data from individuals.

The systems and methods described herein provide an efficient, differentially private algorithm for computing heatmaps with provable privacy guarantees. Also provided are empirical evaluations of the privacy of these methods.

Some of the systems and methods described herein provide a “primitive” for solving the task of privately aggregating sparse input vectors with a small error as measured by the Earth Mover's Distance (EMD). While often used in relation to heatmaps, the EMD measure is of independent interest: it was originally proposed for computer vision tasks since it matches perceptual similarity better than other measures such as

1,

2, or Kullbeck-Liebler (KL) divergence. It is also well-suited for spatial data analysis since it can take an underlying metric space into account and can be sensitive to “neighboring” bins. For the task of sparse aggregation under EMD, the embodiments described herein provide an efficient algorithm with asymptotically tight error.

To illustrate the systems and methods provided herein, consider the setting where each user i holds a probability distribution p_(i) over grid points in [0; 1)², and the goal is to compute the heatmap of the average of these probabilities, i.e.

$,{\frac{1}{n}{\sum_{i = 1}^{n}{p_{i}.}}}$

An ε-DP algorithm is provided for this task, its theoretical guarantees are established, and empirical evaluations of its performance are provided herein.

Generally, the embodiments described herein apply methods for sparsely aggregating the probability distributions under the EMD. In particular, it is desirable to generate an estimate for

$\frac{1}{n}{\sum_{i = 1}^{n}p_{i}}$

with error measured in the EMD. EMD is used for the error measure for at least two primary reasons. First, a bound on the EMD to the average distribution implies bounds on several metrics commonly used in evaluating heatmaps, including the KL-divergence,

1 distance, and EMD itself. Second, while it is possible to obtain DP aggregation algorithms with bounded EMD error, any DP aggregation algorithm must suffer errors under other metrics, including KL-divergence or

1 distance, that grow with the resolution, rendering them impractical in the setting where the number of users is small compared to the resolution.

When the distributions' p_(i)'s are arbitrary, a simple c-DP algorithm yields a guarantee of O_(ε)(1/√n) on EMD, and this bound is essentially optimal. While this is already a reasonable bound, the methods described herein improve on it by exploiting a property that is commonly present in distributions used for aggregations: they tend to be “sparse”.

The approximation guarantee for the sparse EMD aggregation problem is defined herein with respect to the best k-sparse distribution that approximates the average

$a:={\frac{1}{n}{\sum_{i = 1}^{n}p_{i}}}$

under EMD. More formally, an output distribution â is by definition a “λ, κ)-approximation” for sparse EMD aggregation if it satisfies

${{{EMD}\left( {\hat{a},a} \right)} \leq {{{\lambda \cdot \min\limits_{k - {{sparse}a^{\prime}}}}{{EMD}\left( {a^{\prime},a} \right)}} + \kappa}},{{where}\lambda},{\kappa > 0}$

denote the multiplicative approximation ratio and additive error respectively and a′ is any k-sparse distribution (e.g., the k-sparse distribution that minimizes EMD(a′,a)).

Under such a sparse approximation notion, an error of only O_(ε)(√{square root over (k)}/n) is achievable, and this error is tight.

Theorem 1.1 (Informal): There exists an c-DP algorithm that, for any constant λ∈(0,1), can output a (λ, O_(ε)(√{square root over (k)}/n))-approximation for sparse EMD aggregation with probability 0.99. Furthermore, no ε-DP algorithm can output a (λ,o_(ε)(√{square root over (k)}/n))-approximate solution with probability 0.1.

This result has implications on DP Clustering. Sparse EMD aggregation is intimately related to clustering—for example, the k-median problem—on the plane. Indeed, an algorithm for sparse EMD aggregation implies a coreset for k-median on the plane, which in turn can be used to derive an approximation algorithm for the problem. Accordingly, the following are implied:

Corollary 1.2. There exists an ε-DP algorithm that, for any constant (0,1), can output a (λ,O(√{square root over (k)}/ε))-coreset for k-median, where each input point belongs to [0,1)².

Corollary 1.3. There exists an ε-DP algorithm that, for any constant λ∈(0,1), can output a (1+λ,O(√{square root over (k)}/ε))-approximation for k-median, where each input point belongs to [0,1)².

Prior attempts have incurred an additive error of Ω(k), while results for the methods described herein achieve an improved O(√{square root over (k)}) additive error; this bound is also tight due to an adaptation of the lower bound for k-sparse distribution aggregation used herein. This additive error is also independent of the number n, of users, a significant benefit. Prior attempts incurred an additive error that depended on the number n of users.

The methods described herein were tested on both real-world location datasets and synthetic datasets. The results demonstrate practicality even for moderate values of c E [0.5,5] and a number of users equal to 200. The methods described herein were also compared with simple baselines. When applying popular metrics for heatmaps, the results for the methods described herein demonstrate significant improvements on these regimes of parameters.

Some implementations of the methods described herein are related to determining an underlying vector x that is known to be well-approximated by a sparse vector; a matrix A can be provided such that, when measurements Ax are observed of x, an x′ can be reconstructed that is close to x (under a certain metric). This can be accomplished trivially by taking A to, e.g., be the identity matrix. Thus, the objective is to perform this recovery task using as few measurements (i.e., number of rows of A) as possible.

Such a “compressed sensing” under EMD can be reduced to a problem under

1. Such a reduction can include finding a linear transformation that satisfies certain properties. Once such a transformation is specified, the algorithm can proceed (roughly) as follows: transform the input x, run the compressed sensing scheme for

1, and then “reverse” the transformation to get x′. Note that the number of measurements required for such a solution 0 is that of the

compressed sensing scheme.

One can apply such a method to DP aggregation by viewing the hidden vector x as the sum Σ_(i=1) ^(n)p_(i), and then add Laplacian noise to each measurement to ensure privacy. The robustness of known

1 compressed schemes can be applied to analyze the error of such a method; unfortunately, since the error will scale according to the

1 norm of the noise vector and the noise vector consists of O(k·log(n/k)) entries, this approach only provides an error guarantee of O(k·poly log n).

To overcome this limitation barrier, a difference between the goals of compressed sensing and DP aggregation can be leveraged, namely, that the former aims to minimize the number of measurements whereas the latter aims to minimize the error due to the noise added (irrespective of the number of measurements). Thus, methods from compresses sensing can be applied without compressing, e.g., measuring the entire transformation. The log n error of such modified method can be further reduced by selecting a different noise magnitude for each measurement, which allows for the O(√{square root over (k)}) error noted above to be obtained.

Determination of the lower bound follows a packing framework. Specifically, a set of k-sparse distributions was generated whose pairwise EMDs are at least Ω(1/√{square root over (k)}). This construction is based on an

1 packing of the √{square root over (k)}×√{square root over (k)} grid, which gives a set of size 2^(Ω(k)). It then follows that the error is at least Ω_(ε)(√{square root over (k)}/n) with probability 0.9.

Notational Definitions

For every non-negative integer N∈

∪{0}, [N] is written to denote {0, . . . , N}. Let G_(Δ) be the set of (Δ×Δ) grid points in [0,1)²; specifically, G_(Δ)={(i/Δ, j/Δ)|i,j ∈[Δ−1]}. For notational convenience, it is assumed throughout that Δ=2

for some

∈

.

For an index set I, p∈

^(I) is viewed as a vector indexed by I and p(i) is written to denote the value of its i^(th) 5 coordinate; this notation extends naturally to the set S∈I of coordinates, for which we let p(S):=Σ_(i∈I)p(i). Furthermore, p|s is used to denote the restriction of p to S; more formally p|s(i)=p(i) if i∈S and p|s(i)=0 otherwise. Additionally, p| _(s) is written as a shorthand for p−p|s, i.e., the restriction of p to the complement of S. supp(p) is used to denote the set of non-zero coordinates of vector p. A vector is said to be k-sparse if its support is of size at most k. Recall that the

1-norm of a vector p∈

^(I) is ∥p∥₁:=Σ_(i∈I)|p(i)|.

Given two non-negative vectors p, q∈

_(≥0) ^(G) ^(Δ) such that ∥p∥₁=∥q∥₁, their earth mover's distance (EMD) is defined as

${{EMD}\left( {p,q} \right)} = {\min\limits_{\gamma}{\sum\limits_{x \in G_{\Delta}}{\sum\limits_{y \in G_{\Delta}}{{\gamma\left( {x,y} \right)} \cdot {{{x - y}}_{1}.}}}}}$

Here, the minimum is over γ∈

_(≥0) ^(G) ^(Δ) ^(×G) ^(Δ) whose marginals are p and q. (More formally, for all x∈G_(Δ),Σ_(y∈G) _(Δ) γ(x, y)=p(′) and, for all y∈G_(Δ), Σ_(x∈G) _(Δ) γ(x, y)=q(y).

The EMD norm of a vector w∈

^(G) ^(Δ) is defined by

${{w}_{EMD}:={{\underset{{p - q + {rw}},{{p}_{1} = {q}_{1}}}{\min\limits_{p,{q \in R_{\geq 0}^{C_{\Delta}}}}}{{EMD}\left( {p,q} \right)}} + {a \cdot {r}}}},{{{where}\alpha} = 2}$

is the diameter of our space [0,1)×[0,1).

The following lemma is useful when dealing with unnormalized vs normalized vectors.

Lemma 2.1. Suppose that s,s∈

_(≥0) ^(Gis Δ) are such that ∥s∥₁=n and ∥s−ŝ∥_(EMD)≤n/2. Let a=s/∥s∥₁ and â=ŝ/∥ŝ∥₁. Then, ∥a−{dot over (a)}∥_(EMD)≥4∥s−ŝ∥_(EMD)/n.

Two input datasets X and X′ are said to be neighbors if X′ results from adding or removing a single user's data from X. In the present setting, each user i s data is a probability distribution p_(i) over G_(Δ).

Definition 2.1 (Differential Privacy). A mechanism M is said to be ε-DP if and only if, for every set O of outputs and every pair X,X′ of neighboring datasets, Pr[M(X)∈O] ≥e^(ε). Pr[M(X′)∈O].

Definition 2.2 (Laplace Mechanism). For any vector-valued function ƒ, its

1-sensitivity, denoted by S₁(ƒ), is defined as max_(neighboring X, X′)∥∫(X)−∫(X′)∥₁. The Laplace mechanism with parameter b>0 adds an independent noise drawn from the Laplace distribution Lap(b) to each coordinate of ƒ.

Lemma 2.2. The Laplace mechanism with parameter S₁(ƒ)/ε is ε-DP.

Given p∈

_(≥0) ^(G) ^(Δ) , its associated heatmap with Gaussian filter variance σ² is defined as

${{H_{p}^{\sigma}\left( {x,y} \right)} = {\sum\limits_{{({x^{\prime},y^{\prime}})} \in G_{\Delta}}{\frac{1}{Z\left( {x^{\prime},y^{\prime}} \right)}{e^{- \frac{{({x - x^{\prime}})}^{2} + {({y - y^{\prime}})}^{2}}{2\sigma^{2}}} \cdot {p\left( {x^{\prime},y^{\prime}} \right)}}{for}{all}}}}{{\left( {x,y} \right) \in G_{\Delta}},{{{where}{Z\left( {x^{\prime},y^{\prime}} \right)}}:=\sum_{{({x^{\prime},y^{\prime}})} \in {G_{\Delta}c^{- \frac{{({x - x^{\prime}})}^{2} + {({y - y^{\prime}})}^{2}}{2\sigma^{2}}}}}}}$

is the normalization factor.

In the heatmap aggregation problem over n users, each user i has a probability distribution p_(i) over G_(Δ). The goal is to output an estimate of the aggregated heatmap H_(a) ^(σ) where

$a = {\frac{1}{n}{\sum_{i \in {\lbrack n\rbrack}}{P_{i}.}}}$

Example Algorithm

In this section, an example implementation of a private, sparse EMD aggregation algorithm as described herein is described.

Theorem 3.1. For any ε>0 and λ∈(0,1), there is an ε-DP algorithm that can, with probability 0.99, output a

$\left( {\lambda,{O\left( \frac{\sqrt{k}}{\lambda\varepsilon n} \right)}} \right)$

-approximation for the k-sparse EMD aggregation problem.

As alluded to above, a linear transformation can be used. In particular, the so-called (scaled) pyramidal transform, whose variant is also often used in (metric) embedding of EMD to

1, can be used as the linear transform. Roughly speaking, the transform represents a hierarchical partitioning of [0,1)² into subgrids, where a subgrid at a level is divided into four equal subgrids at the next level. The (scaled) pyramidal transform has one row corresponding to each subgrid; the row is equal to the indicator vector of the subgrid scaled by its side length. These are formalized below.

Definition 3.1 (Grid Cells). For i∈

∪{0}, let C₂ _(i) denote the set of level i grid cells defined as C₂ _(i) :={[a,a+2^(−i))×[b,b+^(2-i))

(à, {acute over (b)})∈G₂ _(i) }; let m_(i):=|C₂ _(i) |.

Definition 3.2 ((Scaled) Pyramidal Transform). For i∈[

], the level-i grid partition map is defined as the matrix P_(i)∈{0,1}^(C) ² ^(i) ^(×G) ^(Δ) where P_(i)(c,p)=1 if and only if p∈c. The (scaled) pyramidal transform is the matrix P∈

∪_(i=0)

C₂ _(i) ×G_(Δ) defined by P:=[P₀ ^(T)2⁻¹P_(l) ^(T) . . . 2⁻

P

^(T)]^(T).

The example implementation of an algorithm for sparse EMD aggregation includes two components. The first component, presented in Algorithm 1 below, first aggregates the input distributions (Line 3) and applies the pyramidal transform to the aggregate, adding different amounts of Laplace noise for different levels of the grid (Lines 5, 6). (The parameters ε1, . . . , ε

, which govern the amount of added Laplace noise, will be specified in the next subsection.) The second component, presented in Algorithm 2 below, takes these noisy measurements for every level of the grid and reconstructs the solution by first recovering the

₁ solution (Line 8) and then the EMD solution using a linear program (Line 9).

These implementations provide at least the following two improvements: first, noise is added to the measurements and, secondly, there is no explicit “compression” performed, such compression taking a wide matrix A for for

₁ recovery and multiplies it with Ps.

Algorithm 1: DPSparseEMDAgg

1: Input: distributions p₁, . . . , p_(n) on G_(Δ)

2: Parameters: ε₁, . . . , ε

>0, w∈

3: s←Σ_(i=1) ^(n)p_(i)

4: for i=0, . . . ,

do

5: v_(i)←Lap(1/ε_(i))^(⊗m) ^(i)

6:

$\left. y_{i}^{\prime}\leftarrow{\frac{1}{2^{i}}\left( {{P_{i}s} + v_{i}} \right)} \right.$

7: end for

8: y′←[y′₀ . . . y′

]

9: ŝ←Reconstruct(y′;w)

10: return â:=ŝ/∥ŝ∥

Algorithm 2: Reconstruct

1: Input: noisy measurements y′∈

2: Parameters: w∈

3: S₀←C₁

4: for i=0, . . . ,

do

5: T_(i)←children(S_(i-1))

6: S_(i)←the set of min{w, |T_(i)|} coordinates in T_(i) with maximum values in y′

7: end for

8: S←

S_(i) and ŷ←y′|_(s)

9: return ŝ←arg min_(s′≥0)∥ŷ−Ps′∥₁

This “recovered” ŷ is close, in the

1 metric, to the true value of Ps. Additionally, the outputs is close, in EMD, to s, due in part to the properties of P. Since noise is added to the measurement, the analysis is extended to be robust to noise. Finally, the privacy parameters ε₁, . . . , ε

are determined to finish the proof of Theorem 3.1.

The additive error bound for ŝ is O_(ε,λ)(√{square root over (k)}) (which ultimately gives the O_(ε,λ)(√{square root over (k)}/n) error bound for the normalized â). w=O_(λ)(k) was selected so as to have an additive error of O_(ε)(√{square root over (w)}). At a high level, each noise

$\frac{1}{2^{i}} \cdot {v_{i}(t)}$

added to a “queued” term y_(i)(t) for t∈T_(i) permeates to an error of the same order. For simplicity, assume for the moment that |v_(i)(t)|=O(1/ε_(λ)). Now, notice that at level i<log √w, then |T_(i)|=C₂ _(i) |=2^(2i) and thus the total error contribution of this level is O(2^(i)/ε_(i)). On the other hand, for a level i≥log √{square root over (w)}, |T_(i)|=w and the error contribution is

${O\left( \frac{w}{2 \cdot K_{t}} \right)}.$

When i=log √{square root over (w)}±O(1), these error terms are O(√{square root over (w)}/ε_(i)) and thus ε_(i) should be set equal to Ω(1) to get the desired bound. However, in terms of |i−log √{square root over (w)}|, these error contributions become exponentially smaller, i.e.,

${O\left( \frac{\sqrt{w}}{2^{❘{i - {\log\sqrt{w}}}❘}\varepsilon_{i}} \right)}.$

Thus, the ε_(i) can be set proportional to γ^(|i-log √{square root over (w)}|) for some constant γ>0.5. This results in the desired O_(ε)(√{square root over (w)})=O_(ε,λ)(√{square root over (k)}) bound.

Phase I:

1 Recovery. The

1 recovery guarantee of ŷ is analysed in this section. The recovery algorithm used herein, which is an adaptation of Indyk and Price's k-median clustering, model-based compressive sensing, and sparse recovery for earth mover distance in STOC, pages 627-636, 2011, works well for hidden vectors that follow a certain “tree-like structure”, defined formally below.

Definition 3.3. For i≥1, a grid cell c′∈C₂ _(i) is said to be a child of grid cell c∈C₂ _(i-1) if c⊆c′. This forms a tree rooted at [0,1)×[0,1)∈C₀ where every internal node has exactly four children. Letting T_(w) denote the set of all trees such that the number of nodes at each level is at most w.

Let M_(w) denote the set of y=[y₀, . . . , y

] where y_(i)∈

_(≥0) ^(C) ² ^(i) such that

1. supp(y)∈T for some tree T∈T_(w).

2. For all i∈┌

−1┐, p∈C₂ _(i) , the following holds: y(p)≥2·y(children(p)).

Under the above analyses, the

1 recovery analysis of Indyk and Price in the no-noise case can be applied to the regime described herein, where the noise shows up as an error:

Lemma 3.2. Let y*∈

∥Ps−y∥₁ where supp(y*)⊆T* for some T*∈T_(w); let T_(i)* denote T*∩C₂ _(i) and V_(i)=T_(i)*\S_(i) for all i∈[

]. Then, ŷ on Line 8 of Reconstruct satisfies

${{\hat{y} - {Ps}}}_{1} \leq {{3{{y^{*} - {Ps}}}_{1}} + {{O\left( {\sum\limits_{i \in {\lbrack\ell\rbrack}}{\frac{1}{2^{i}}{{v_{i}❘{v_{i}\bigcup s_{i}}}}_{1}}} \right)}.}}$

Phase II: From

1 to EMD. We now proceed to bound the EMD error.

Lemma 3.3. Let the notation be as in Lemma 3.2. For any η′∈(0,1), by setting w=O(k/(η′)²), the output ŝ of Reconstruct satisfies:

${{s - s^{*}}}_{EMD} \leq {{{\eta^{\prime} \cdot \min\limits_{k - {{sparse}s^{\prime}}}}{{s - s^{\prime}}}_{EMD}} + {{O\left( {\sum\limits_{i \in {\lbrack\ell\rbrack}}{\frac{1}{2^{i}}{{v_{i}❘{v_{i}\bigcup s_{i}}}}_{1}}} \right)}.}}$

Finishing the Proof the privacy parameters can thus be selected so as to complete the proof of Theorem 3.1. Let w=O(k/(η′)²) be as in Lemma 3.3 with η′=λ/4, and let q=└log₂√{square root over (w)}┘. Let γ=0.8 be the “decay rate” for the ε_(i)'s, and let Z=Σ_(i=0)

γ^(|i-q|)≤O(1) be the normalization factor. Algorithm 1 us then executed with ε_(i)=γ^(|i-q|)·ε/Z.

Privacy Analysis. One can view the i^(th) iteration of the algorithm as releasing 2^(i)y′_(i)=P_(i)s+v_(i). Since each p_(i) has

1 norm at most one, its sensitivity with respect to P_(i)s is at most one; thus, Lemma 2.2 implies that the i^(th) iteration is ε-DP. As a result, by basic composition theorem of DP, one can conclude that releasing all of y′₀, . . . , y′

is (ε₀+ . . . +ε

)−DP. Since the reconstruction is a post-processing step, the post-processing property of DP ensures that Algorithm 1 is (ε₀+ . . . +ε

)−DP. Finally, observe that by the definition of the ε_(i)'s, ε₀+ . . . +ε

=ε as desired.

Utility Analysis. Applying Lemma 3.3, one can conclude that

${{{s - s^{*}}}_{EMD} \leq {{{\eta^{\prime} \cdot \min_{k - {{sparse}s^{\prime}}}}{{s - s^{\prime}}}_{EMD}} + \xi_{t}}},{{where}^{\xi = {O({\sum_{i \in {\lbrack\ell\rbrack}}{\overset{\_}{\frac{1}{2_{i}}}{{v_{i}❘{v_{i}\bigcup s_{i}}}}_{1}}})}}.}$

Recall that each of V_(i), S_(i)'s is of size at most max {w,2 ^(2i)} (because of the definition of M_(w) and the fact that m_(i)=|C₂ _(i) |=2^(2i)), and that each entry of v_(i) is sampled from Lap(1/ε_(i)). As a result,

${{{{\mathbb{E}}\lbrack\xi\rbrack} \leq {0\left( {\sum\limits_{i \in {\lbrack\ell\rbrack}}{{\frac{1}{2^{i}} \cdot \max}{\left\{ {w,2^{2i}} \right\} \cdot \frac{1}{\varepsilon_{i}}}}} \right)}} = {{{O\left( {\sum\limits_{i \in {\lbrack q\rbrack}}\frac{2^{i}}{\gamma^{q - i}\varepsilon}} \right)} + {O\left( {\sum\limits_{i \in {\{{{q + 1},\ldots,\ell}\}}}{\frac{k}{\lambda^{2}} \cdot \frac{1}{2^{i}\gamma^{i - q}\varepsilon}}} \right)}} = {{{O\left( \frac{2^{q}}{\varepsilon} \right)} + {O\left( {\frac{k}{\lambda^{2}} \cdot \frac{1}{2^{q}} \cdot \frac{1}{\varepsilon}} \right)}} = {O\left( \frac{\sqrt{k}}{\lambda\varepsilon} \right)}}}},$

where the last line follows from the choice of γ>0.5 and q=└log₂√{square root over (w)}┘.

Hence, by Markov's inequality, with probability 0.99,

${{{s - s^{*}}}_{EMD} \leq {{{\eta^{\prime} \cdot \min\limits_{k - {{sparse}s^{\prime}}}}{{s - s^{\prime}}}_{EMD}} + {100{{\mathbb{E}}\lbrack\xi\rbrack}}}} = {{{\eta^{\prime} \cdot \min\limits_{k - {{sparse}s^{\prime}}}}{{s - s^{\prime}}}_{EMD}} + {{O\left( \frac{\sqrt{k}}{\lambda\varepsilon} \right)}.}}$

Finally, application of Lemma 2.1 concludes the proof.

Experiments

Performance of the example algorithms described herein was assessed on two real-world location datasets.

Implementation Details. Algorithm 1 was implemented with a minor modification: measurement was not performed at the level i<└log √{square root over (w)}┘. In other words, the algorithm was begun directly at the lowest level for which the number of grid cells is at most √w. It is possible to adjust the proof to show that, even with this modification, the error remains O_(ε)(√{square root over (k)}). Apart from this, the algorithm was implemented as described above. Note that the linear program at the end of Algorithm 2 can be formulated so that the number of variables is only O(w

); the reason is that only one variable per cell is needed to be left out at each stage. This allows for efficient solution even when the resolution Δ=

is large.

As for the parameters, the decay rate γ=1/√2 was used, which was obtained by minimizing the second error term in the proof of Theorem 3.1 as

→∞. w=20 was used in the experiments, which worked well for the datasets considered.

Datasets. Two datasets available at snap.stanford.edu were used to generate the input distribution for users. The first dataset, called GOWALLA, consists of location check-ins by users of the location based social network Gowalla. Each record consists of, among other things, an anonymized user id together with the latitude (lat) and longitude (lon) of the check-in and a timestamp. This dataset was filtered to consider only check-ins roughly in the continental US for the month of January 2010; this resulted in 196,071 check-ins corresponding to 10,196 users. The second dataset, called BRIGHTKITE, also contains check-ins from a different and now defunct location-based social network Brightkite; each record is similar to GOWALLA. Once again, this dataset was filtered to consider only check-ins in the continental US for the months of November and December 2008; this resulted in 304,608 check-ins corresponding to 10,177 users.

For each of these datasets, the whole area was partitioned into a 300×300 grid. The top 30 cells (in both datasets combined) were then preserved that had have the most check-ins. (Each of the 30 cells was mostly around some city like New York, Austin, etc, and had check-ins from at least 200 unique users). Each cell was then partitioned into Δ×Δ subgrids and each check-in was snapped to one of these subgrids.

Metrics. To evaluate the quality of an output heatmap ĥ compared to the true heatmap h, the following commonly used metrics were determined: Similarity, Pearson coefficient, KL-divergence, and EMD. Note that the first two metrics should increase as ĥ, h are more similar, whereas the latter two should decrease.

Baselines. As a baseline for comparison a recently proposed algorithm was applied, where Laplace noise was added to each subgrid cell of the sum s, any negative cells were zeroed out, and the heatmap was produced from this noisy aggregate. A “thresholding” variant of this baseline was also considered that was more suited to sparse data: only keeping the top t % of the cell values after noising (and zeroing out the rest).

In the first set of experiments, Δ was fixed to 256. For each ε∈{0.1, 0.5, 1, 2, 5, 10}, the algorithms were run together with the baseline and its variants on all 30 cells, with 2 trials for each cell. In each trial, a set of 200 users was sampled and used to run all the algorithms; the distance metrics were then computed between the true heatmap and the estimated heatmap. The average of these metrics over the 60 runs is presented in FIG. 4 , together with the 95% confidence interval (shading). As can be seen in FIG. 4 , the baseline has rather poor performance across all metrics, even for large ε=10. Several values oft were evaluated for the thresholding variant, yielding a significant improvement. Despite this, the algorithm described herein consistently out-performed the alternatives across all metrics. These improvements are especially significant when E is not too large or too small (e.g., 0.2≤ε≤5).

In the second set of experiments, the effect of varying the number n of users was studied. By fixing a single cell (that contains>500 users) and ε, n was swept across {50, 100, 200, 300, 400, 500} users. For each value of n, 10 trials were run and their results averaged. As predicted by theory, it was observes that the algorithms described herein and the original baseline performed better as n increased. However, the behavior of the thresholding variants of the baseline method were less predictable, and sometimes the performance degraded with a larger number of users. It seems plausible that a larger number of users cause an increase in the sparsity, which after some point makes the simple thresholding approach unsuited for the data.

Another set of experiments were also run a single cell and E were fixed, and the resolution Δ was varied across {64, 128, 256}. In agreement with theory, the algorithm described herein had a utility that remained nearly constant for the entire range of Δ. On the other hand, the original baseline method suffered across all metrics as Δ increased. The effects on the thresholding variants of the baseline method were more subtle; they occasionally improved as Δ increased, which might be attributed to the fact that when Δ is small, thresholding can zero out too many subgrid cells.

FIG. 5 presents results from the latter two sets of experiments under the EMD metric when ε=10. Examples of the heatmaps resulting from the experiments according to each approach are depicted in FIG. 6 (ε=1 (top) and ε=5 (bottom)). The algorithms from left to right in FIG. 6 are: original heatmap (no privacy), baseline method, baseline method with top 0.01% thresholding, and the method described herein.

V. Conclusion

The particular arrangements shown in the Figures should not be viewed as limiting. It should be understood that other embodiments may include more or less of each element shown in a given Figure. Further, some of the illustrated elements may be combined or omitted. Yet further, an exemplary embodiment may include elements that are not illustrated in the Figures.

Additionally, while various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are contemplated herein. 

What is claimed is:
 1. A computer-implemented method comprising: receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity; using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales; adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales; and based on the anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation.
 2. The computer-implemented method of claim 1, wherein the input distribution is a two-dimensional distribution.
 3. The computer-implemented method of claim 1, further comprising: receiving the individual distributions, wherein receiving the input distribution comprises determining a mean of the individual distributions.
 4. The computer-implemented method of claim 1, wherein the individual distributions are indicative of gaze directions.
 5. The computer-implemented method of claim 1, wherein the individual distributions are indicative of cursor locations.
 6. The computer-implemented method of claim 1, wherein the multi-scale spatial transform is a pyramidal transform.
 7. The computer-implemented method of claim 1, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding Laplacian-distributed noise to each of the two or more representations of the input distribution at respective different spatial scales such that the variance of the Laplacian-distributed noise varies according to the spatial scales of the two or more representations of the input distribution at respective different spatial scales.
 8. The computer-implemented method of claim 7, wherein the variance of the Laplacian-distributed noise decreases from one spatial scale to the next according to a decay parameter, and wherein the decay parameter has a value between 0.5 and 1.0.
 9. The computer-implemented method of claim 1, wherein reconstructing the output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation comprises determining the output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in a reduced difference between the anonymized multi-scale representation and the output distribution when transformed using the multi-scale spatial transform.
 10. The computer-implemented method of claim 9, wherein the reduced difference between the anonymized multi-scale representation and the output distribution when transformed using the multi-scale spatial transform is a reduced

distance between the anonymized multi-scale representation and the output distribution when transformed using the multi-scale spatial transform.
 11. The computer-implemented method of claim 9, further comprising: prior to reconstructing the output distribution, sparsifying the anonymized multi-scale representation such that no more than a specified number of elements of any of the two or more representations of the input distribution at respective different spatial scales are non-zero.
 12. The computer-implemented method of claim 1, further comprising: generating a heat map based on the output distribution.
 13. A computer-implemented method comprising: receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity; using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales; adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales; sparsifying the anonymized multi-scale representation such that no more than a specified number of elements of any of the two or more representations of the input distribution at respective different spatial scales are non-zero; and based on the sparsified anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the sparsified anonymized multi-scale representation.
 15. The computer-implemented method of claim 13, further comprising: receiving the individual distributions, wherein receiving the input distribution comprises determining a mean of the individual distributions.
 15. The computer-implemented method of claim 13, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding Laplacian-distributed noise to each of the two or more representations of the input distribution at respective different spatial scales such that the variance of the Laplacian-distributed noise varies according to the spatial scales of the two or more representations of the input distribution at respective different spatial scales.
 16. The computer-implemented method of claim 15, wherein the variance of the Laplacian-distributed noise decreases from one spatial scale to the next according to a decay parameter, and wherein the decay parameter has a value between 0.5 and 1.0.
 17. The computer-implemented method of claim 13, wherein reconstructing the output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation comprises determining the output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in a reduced difference between the anonymized multi-scale representation and the output distribution when transformed using the multi-scale spatial transform.
 18. The computer-implemented method of claim 17, wherein the reduced difference between the anonymized multi-scale representation and the output distribution when transformed using the multi-scale spatial transform is a reduced

distance between the anonymized multi-scale representation and the output distribution when transformed using the multi-scale spatial transform.
 19. A computing device comprising: one or more processors, wherein the one or more processors are configured to perform operations comprising: receiving an input distribution that is an average of a number of individual distributions, each individual distribution representing data from a respective different entity; using a multi-scale spatial transform, transforming the input distribution into a multi-scale representation that includes two or more representations of the input distribution at respective different spatial scales; adding noise to the multi-scale representation to generate an anonymized multi-scale representation, wherein adding noise to the multi-scale representation to generate the anonymized multi-scale representation comprises adding a different amount of noise to each of the two or more representations of the input distribution at respective different spatial scales; and based on the anonymized multi-scale representation, reconstructing an output distribution such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation.
 20. The computing device of claim 19, wherein the operations further comprise: sparsifying the anonymized multi-scale representation such that no more than a specified number of elements of any of the two or more representations of the input distribution at respective different spatial scales are non-zero, wherein reconstructing the output distribution comprises reconstructing the output distribution based on the sparsified anonymized multi-scale representation such that the output distribution, when transformed using the multi-scale spatial transform, results in an approximation of the anonymized multi-scale representation. 